Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chen-Hsuan Lin

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

May 05, 2025

Lu Ling, Chen-Hsuan Lin, Tsung-Yi Lin, Yifan Ding, Yu Zeng, Yichen Sheng, Yunhao Ge, Ming-Yu Liu, Aniket Bera, Zhaoshuo Li

Abstract:Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.

Via

Access Paper or Ask Questions

Dynamic Camera Poses and Where to Find Them

Apr 24, 2025

Chris Rockwell, Joseph Tung, Tsung-Yi Lin, Ming-Yu Liu, David F. Fouhey, Chen-Hsuan Lin

Abstract:Annotating camera poses on dynamic Internet videos at scale is critical for advancing fields like realistic video generation and simulation. However, collecting such a dataset is difficult, as most Internet videos are unsuitable for pose estimation. Furthermore, annotating dynamic Internet videos present significant challenges even for state-of-theart methods. In this paper, we introduce DynPose-100K, a large-scale dataset of dynamic Internet videos annotated with camera poses. Our collection pipeline addresses filtering using a carefully combined set of task-specific and generalist models. For pose estimation, we combine the latest techniques of point tracking, dynamic masking, and structure-from-motion to achieve improvements over the state-of-the-art approaches. Our analysis and experiments demonstrate that DynPose-100K is both large-scale and diverse across several key attributes, opening up avenues for advancements in various downstream applications.

* Accepted to CVPR 2025. Project Page: https://research.nvidia.com/labs/dir/dynpose-100k

Via

Access Paper or Ask Questions

Cosmos World Foundation Model Platform for Physical AI

Jan 07, 2025

NVIDIA, :, Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen(+69 more)

Figure 1 for Cosmos World Foundation Model Platform for Physical AI

Figure 2 for Cosmos World Foundation Model Platform for Physical AI

Figure 3 for Cosmos World Foundation Model Platform for Physical AI

Figure 4 for Cosmos World Foundation Model Platform for Physical AI

Abstract:Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.

Via

Access Paper or Ask Questions

Edify 3D: Scalable High-Quality 3D Asset Generation

Nov 11, 2024

NVIDIA, :, Maciej Bala, Yin Cui, Yifan Ding, Yunhao Ge, Zekun Hao, Jon Hasselgren, Jacob Huffman, Jingyi Jin(+15 more)

Figure 1 for Edify 3D: Scalable High-Quality 3D Asset Generation

Figure 2 for Edify 3D: Scalable High-Quality 3D Asset Generation

Figure 3 for Edify 3D: Scalable High-Quality 3D Asset Generation

Figure 4 for Edify 3D: Scalable High-Quality 3D Asset Generation

Abstract:We introduce Edify 3D, an advanced solution designed for high-quality 3D asset generation. Our method first synthesizes RGB and surface normal images of the described object at multiple viewpoints using a diffusion model. The multi-view observations are then used to reconstruct the shape, texture, and PBR materials of the object. Our method can generate high-quality 3D assets with detailed geometry, clean shape topologies, high-resolution textures, and materials within 2 minutes of runtime.

* Project website: https://research.nvidia.com/labs/dir/edify-3d

Via

Access Paper or Ask Questions

Neuralangelo: High-Fidelity Neural Surface Reconstruction

Jun 12, 2023

Zhaoshuo Li, Thomas Müller, Alex Evans, Russell H. Taylor, Mathias Unberath, Ming-Yu Liu, Chen-Hsuan Lin

Abstract:Neural surface reconstruction has been shown to be powerful for recovering dense 3D surfaces via image-based neural rendering. However, current methods struggle to recover detailed structures of real-world scenes. To address the issue, we present Neuralangelo, which combines the representation power of multi-resolution 3D hash grids with neural surface rendering. Two key ingredients enable our approach: (1) numerical gradients for computing higher-order derivatives as a smoothing operation and (2) coarse-to-fine optimization on the hash grids controlling different levels of details. Even without auxiliary inputs such as depth, Neuralangelo can effectively recover dense 3D surface structures from multi-view images with fidelity significantly surpassing previous methods, enabling detailed large-scale scene reconstruction from RGB video captures.

* CVPR 2023, project page: https://research.nvidia.com/labs/dir/neuralangelo

Via

Access Paper or Ask Questions

ATT3D: Amortized Text-to-3D Object Synthesis

Jun 06, 2023

Jonathan Lorraine, Kevin Xie, Xiaohui Zeng, Chen-Hsuan Lin, Towaki Takikawa, Nicholas Sharp, Tsung-Yi Lin, Ming-Yu Liu, Sanja Fidler, James Lucas

Figure 1 for ATT3D: Amortized Text-to-3D Object Synthesis

Figure 2 for ATT3D: Amortized Text-to-3D Object Synthesis

Figure 3 for ATT3D: Amortized Text-to-3D Object Synthesis

Figure 4 for ATT3D: Amortized Text-to-3D Object Synthesis

Abstract:Text-to-3D modelling has seen exciting progress by combining generative text-to-image models with image-to-3D methods like Neural Radiance Fields. DreamFusion recently achieved high-quality results but requires a lengthy, per-prompt optimization to create 3D objects. To address this, we amortize optimization over text prompts by training on many prompts simultaneously with a unified model, instead of separately. With this, we share computation across a prompt set, training in less time than per-prompt optimization. Our framework - Amortized text-to-3D (ATT3D) - enables knowledge-sharing between prompts to generalize to unseen setups and smooth interpolations between text for novel assets and simple animations.

* 22 pages, 20 figures

Via

Access Paper or Ask Questions

Magic3D: High-Resolution Text-to-3D Content Creation

Nov 18, 2022

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, Tsung-Yi Lin

Figure 1 for Magic3D: High-Resolution Text-to-3D Content Creation

Figure 2 for Magic3D: High-Resolution Text-to-3D Content Creation

Figure 3 for Magic3D: High-Resolution Text-to-3D Content Creation

Figure 4 for Magic3D: High-Resolution Text-to-3D Content Creation

Abstract:DreamFusion has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF), achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: (a) extremely slow optimization of NeRF and (b) low-resolution image space supervision on NeRF, leading to low-quality 3D models with a long processing time. In this paper, we address these limitations by utilizing a two-stage optimization framework. First, we obtain a coarse model using a low-resolution diffusion prior and accelerate with a sparse 3D hash grid structure. Using the coarse representation as the initialization, we further optimize a textured 3D mesh model with an efficient differentiable renderer interacting with a high-resolution latent diffusion model. Our method, dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is 2x faster than DreamFusion (reportedly taking 1.5 hours on average), while also achieving higher resolution. User studies show 61.7% raters to prefer our approach over DreamFusion. Together with the image-conditioned generation capabilities, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.

* Project website: https://deepimagination.cc/Magic3D

Via

Access Paper or Ask Questions

BARF: Bundle-Adjusting Neural Radiance Fields

Apr 13, 2021

Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, Simon Lucey

Figure 1 for BARF: Bundle-Adjusting Neural Radiance Fields

Figure 2 for BARF: Bundle-Adjusting Neural Radiance Fields

Figure 3 for BARF: Bundle-Adjusting Neural Radiance Fields

Figure 4 for BARF: Bundle-Adjusting Neural Radiance Fields

Abstract:Neural Radiance Fields (NeRF) have recently gained a surge of interest within the computer vision community for its power to synthesize photorealistic novel views of real-world scenes. One limitation of NeRF, however, is its requirement of accurate camera poses to learn the scene representations. In this paper, we propose Bundle-Adjusting Neural Radiance Fields (BARF) for training NeRF from imperfect (or even unknown) camera poses -- the joint problem of learning neural 3D representations and registering camera frames. We establish a theoretical connection to classical image alignment and show that coarse-to-fine registration is also applicable to NeRF. Furthermore, we show that na\"ively applying positional encoding in NeRF has a negative impact on registration with a synthesis-based objective. Experiments on synthetic and real-world data show that BARF can effectively optimize the neural scene representations and resolve large camera pose misalignment at the same time. This enables view synthesis and localization of video sequences from unknown camera poses, opening up new avenues for visual localization systems (e.g. SLAM) and potential applications for dense 3D mapping and reconstruction.

Via

Access Paper or Ask Questions

SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images

Oct 20, 2020

Chen-Hsuan Lin, Chaoyang Wang, Simon Lucey

Figure 1 for SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images

Figure 2 for SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images

Figure 3 for SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images

Figure 4 for SDF-SRN: Learning Signed Distance 3D Object Reconstruction from Static Images

Abstract:Dense 3D object reconstruction from a single image has recently witnessed remarkable advances, but supervising neural networks with ground-truth 3D shapes is impractical due to the laborious process of creating paired image-shape datasets. Recent efforts have turned to learning 3D reconstruction without 3D supervision from RGB images with annotated 2D silhouettes, dramatically reducing the cost and effort of annotation. These techniques, however, remain impractical as they still require multi-view annotations of the same object instance during training. As a result, most experimental efforts to date have been limited to synthetic datasets. In this paper, we address this issue and propose SDF-SRN, an approach that requires only a single view of objects at training time, offering greater utility for real-world scenarios. SDF-SRN learns implicit 3D shape representations to handle arbitrary shape topologies that may exist in the datasets. To this end, we derive a novel differentiable rendering formulation for learning signed distance functions (SDF) from 2D silhouettes. Our method outperforms the state of the art under challenging single-view supervision settings on both synthetic and real-world datasets.

* Accepted to NeurIPS 2020. Project page & code: https://chenhsuanlin.bitbucket.io/signed-distance-SRN/

Via

Access Paper or Ask Questions

Deep NRSfM++: Towards 3D Reconstruction in the Wild

Jan 27, 2020

Chaoyang Wang, Chen-Hsuan Lin, Simon Lucey

Figure 1 for Deep NRSfM++: Towards 3D Reconstruction in the Wild

Figure 2 for Deep NRSfM++: Towards 3D Reconstruction in the Wild

Figure 3 for Deep NRSfM++: Towards 3D Reconstruction in the Wild

Figure 4 for Deep NRSfM++: Towards 3D Reconstruction in the Wild

Abstract:The recovery of 3D shape and pose solely from 2D landmarks stemming from a large ensemble of images can be viewed as a non-rigid structure from motion (NRSfM) problem. To date, however, the application of NRSfM to problems in the wild has been problematic. Classical NRSfM approaches do not scale to large numbers of images and can only handle certain types of 3D structure (e.g. low-rank). A recent breakthrough in this problem has allowed for the reconstruction of a substantially broader set of 3D structures, dramatically expanding the approach's importance to many problems in computer vision. However, the approach is still limited in that (i) it cannot handle missing/occluded points, and (ii) it is applicable only to weak-perspective camera models. In this paper, we present Deep NRSfM++, an approach to allow NRSfM to be truly applicable in the wild by offering up innovative solutions to the above two issues. Furthermore, we demonstrate state-of-the-art performance across numerous benchmarks, even against recent methods based on deep neural networks.

Via

Access Paper or Ask Questions